Fix CellPipe topology FQCN naming#4835
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4835 +/- ##
==========================================
+ Coverage 56.97% 57.05% +0.07%
==========================================
Files 969 969
Lines 92285 92322 +37
==========================================
+ Hits 52578 52671 +93
+ Misses 39707 39651 -56
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Greptile SummaryThis PR fixes CellPipe FQCN naming so that a pipe cell's FQCN parent matches the cell it physically connects to, resolving
Confidence Score: 5/5Safe to merge; the fix is self-contained, well-tested, and the unreleased hierarchical naming scheme (#4801) carries no backward-compatibility burden. The topology naming change is logically correct: each connection scenario (root, own CP, relay, CP-behind-relay, missing parent) is covered by dedicated parametrized tests verifying both happy paths and all ValueError cases. The No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["_cell_fqcn(mode, site, token, parent_fqcn)"] --> V1{token empty?}
V1 -->|yes| E1["ValueError: token must be non-empty"]
V1 -->|no| V2{token starts with 'alias-'?}
V2 -->|yes| E2["ValueError: 'alias-' prefix reserved"]
V2 -->|no| LEAF["cell_name = cellpipe-{token}_{mode}"]
LEAF --> P1{parent_fqcn == ROOT_SERVER?}
P1 -->|yes| R1["prefix = site_name\n(dotted token allowed; routes via root fall-through)"]
P1 -->|no| P2{parent_fqcn ends with site_name?}
P2 -->|yes| V3{'.' in token?}
V3 -->|yes| E3["ValueError: '.' splits FQCN segments"]
V3 -->|no| R2["prefix = parent_fqcn\n(plain leaf under own CP)"]
P2 -->|no| P3{parent_fqcn non-empty?}
P3 -->|yes| V4{"'_' or '.' in token?"}
V4 -->|yes| E4["ValueError: must not contain '_' or '.'"]
V4 -->|no| R3["prefix = parent_fqcn\ncell_name = cellpipe-alias-{site}_{token}_{mode}"]
P3 -->|no| R4["prefix = site_name\n(fallback: warn + root routing)"]
R1 --> OUT["FQCN.join([prefix, cell_name])"]
R2 --> OUT
R3 --> OUT
R4 --> OUT
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["_cell_fqcn(mode, site, token, parent_fqcn)"] --> V1{token empty?}
V1 -->|yes| E1["ValueError: token must be non-empty"]
V1 -->|no| V2{token starts with 'alias-'?}
V2 -->|yes| E2["ValueError: 'alias-' prefix reserved"]
V2 -->|no| LEAF["cell_name = cellpipe-{token}_{mode}"]
LEAF --> P1{parent_fqcn == ROOT_SERVER?}
P1 -->|yes| R1["prefix = site_name\n(dotted token allowed; routes via root fall-through)"]
P1 -->|no| P2{parent_fqcn ends with site_name?}
P2 -->|yes| V3{'.' in token?}
V3 -->|yes| E3["ValueError: '.' splits FQCN segments"]
V3 -->|no| R2["prefix = parent_fqcn\n(plain leaf under own CP)"]
P2 -->|no| P3{parent_fqcn non-empty?}
P3 -->|yes| V4{"'_' or '.' in token?"}
V4 -->|yes| E4["ValueError: must not contain '_' or '.'"]
V4 -->|no| R3["prefix = parent_fqcn\ncell_name = cellpipe-alias-{site}_{token}_{mode}"]
P3 -->|no| R4["prefix = site_name\n(fallback: warn + root routing)"]
R1 --> OUT["FQCN.join([prefix, cell_name])"]
R2 --> OUT
R3 --> OUT
R4 --> OUT
Reviews (14): Last reviewed commit: "Use explicit cellpipe- and cellpipe-alia..." | Re-trigger Greptile |
chesterxgchen
left a comment
There was a problem hiding this comment.
What changed:
- CellPipe FQCN naming changed from hierarchical runtime segments like:
site-1.job-123.active
to topology-aligned leaf names like:
site-1.job-123_active
See /private/tmp/nvflare-pr4835/nvflare/fuel/utils/pipe/cell_pipe.py:50. This makes the FQCN parent site-1, which is the cell the
subprocess actually connects to, instead of the unconnected pseudo-parent site-1.job-123.
-
Routing tests now cover the reported case: a pipe cell named site-1.job-123_active, connected only to site-1, can route to
server.job-123 through site-1. See /private/tmp/nvflare-pr4835/tests/unit_test/fuel/f3/cellnet/core_cell_routing_test.py:46. -
Identity/auth code was adjusted so these alias-style CellPipe stream cells still authenticate as the owning site, but only on the
stream channel and under the same parent. See /private/tmp/nvflare-pr4835/nvflare/private/fed/authenticator.py:306. -
It does not generally change CoreCell._try_find_ep() routing behavior, except comments. The fix is mainly “name the CellPipe cell so
the existing parent fallback works.”
Validation I ran on PR 4835:
90 passed
for:
tests/unit_test/fuel/utils/pipe/cell_pipe_test.py
tests/unit_test/fuel/f3/cellnet/core_cell_routing_test.py
tests/unit_test/fuel/f3/cellnet/identity_binding_test.py
tests/unit_test/private/fed/authenticator_test.py
Does it fix the reported problem?
Likely yes for the primary failure:
ext-process + streaming -> target_unreachable on server.
The new naming restores the parent lookup path described in your root cause. For site-1.job-123_active, the parent fallback is site-1,
and that is the connected client cell, so routing to server. should no longer die at TARGET_UNREACHABLE.
What it does not fix:
- The secondary logger bug is still there:
self.logger.log(f"exception from do_learn_task: ...")
in /private/tmp/nvflare-pr4835/nvflare/app_common/ccwf/client_ctl.py:464. That still should be self.logger.error(...).
- The “job ends FINISHED:COMPLETED after TASK_ABORTED” behavior is not addressed by this PR. If the CellPipe routing fix prevents the
abort, that symptom disappears for this case, but the controller semantics are unchanged for any future abort path.
My read: PR 4835 fixes the actual stream-routing regression, but it is incomplete relative to the full issue report because it leaves
the logger typo and abort/status behavior untouched. It also has unit coverage, not a full POC ext-process streaming regression test.
There was a problem hiding this comment.
@nvidianz before you merge, can you address my comments
|
@chesterxgchen Thanks for the detailed review — to your points (updated: all three are now addressed in this PR):
Please take another look when you get a chance. |
An exception from do_learn_task was only logged; the client never set the error in its status report, so the ccwf server controller never saw the failure and the job ended FINISHED:COMPLETED. Record the error so the next status report triggers system_panic on the server, ending the job as FINISHED:EXECUTION_EXCEPTION. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Follow-up on the two remaining review points — both are now part of this PR: 1. FINISHED:COMPLETED after TASK_ABORTED — addressed at the ccwf level in 8837e2c. 2. Ext-process streaming regression coverage — 44fa08d adds an
Ran on this branch (1 server, 2 clients): Remaining follow-up (separate PR): push error status reports to the server immediately via aux message instead of piggybacking on task pulls, closing the end-of-run race window where a recorded error might not be delivered. |
Groups the two existing ext-process Client API test cases (np_loop_cell_pipe and pt_large_model_pass_through) so the subprocess CellPipe -> server.<jobid> streaming routing path broken by the FQCN naming regression can be exercised on its own as a regression test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Review: PR #4835 — Fix CellPipe topology FQCN naming The PR fixes NVBug 6371056 by collapsing the CellPipe token+mode into a single leaf segment (site-1._active instead of site-1..active), so the FQCN parent matches the physically connected cell and cross-family routing to server. no longer falls back to an unconnected parent. It rewrites the stream-alias check in authenticator.py to a parent-equality + owner-leaf rule, and updates routing/identity/auth test coverage accordingly. The core fix is sound — the two high-severity candidates my finders raised (peer-FQCN disagreement between pipe ends, relay+VIA_ROOT auth rejection) were both refuted on verification: pipe ends provably compute names from identical exported conn props, and the VIA_ROOT scenario is pre-existing on main and unreachable via ScriptRunner. Surviving findings (most severe first)
|
Address review findings on the alias naming: - The <owner>_<runtime_id>_(active|passive) grammar was independently encoded in cell_pipe.py (build), identity.py (parse) and authenticator.py (re-parse). Move it to fqcn.py as make_cell_pipe_alias/parse_cell_pipe_alias so the three sites cannot drift apart. - Fail fast on tokens that break the naming: "." always splits the name into extra FQCN segments; "_" breaks alias parsing when the cell connects through another cell. "_" stays allowed in non-alias forms since the simulator uses the "simulate_job" token. - Correct the _cell_fqcn comment: root-connected pipe cells do NOT have a connected FQCN parent; cross-reference the load-bearing routing fall-through in CoreCell._try_find_ep and its test. - Warn when conn props carry no parent FQCN instead of silently fabricating a topology-shaped name. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@chesterxgchen Thanks — findings 1-4 are addressed in 67ef2db; 5 and 6 declined with rationale below. 1. Token validation — 2. Grammar triplication — the alias grammar now lives in one place: 3. Wrong comment — rewritten: the comment now states explicitly that root-connected pipe cells (simulator/root-url) have an unconnected FQCN parent and rely on the 4. Silent fallback — the missing-parent-FQCN branch now logs a warning so a misconfiguration that produces a correct-looking name is diagnosable. 5. Identical branches — declined as you anticipated: with the warning from (4), the 6. Test file naming — keeping Aside ( All affected unit suites pass (107 targeted + 164 across cellnet/pipe), style checks clean. |
|
Review: PR #4835 — Fix CellPipe topology FQCN naming Overview. The PR fixes the NVBug 6371056 routing regression from #4801 by collapsing the CellPipe token and mode into one leaf segment (site-1._active instead of site-1..active) so a subprocess cell's FQCN parent matches the cell it actually connects to, reintroduces a shared alias grammar (make_cell_pipe_alias/parse_cell_pipe_alias) for relay-connected pipes, and updates mTLS identity resolution and stream auth accordingly. It also bundles an unrelated ccwf fix that reports do_learn_task exceptions to the server. The core routing fix is sound — I verified (and rejected) candidates claiming peer-name divergence between pipe ends and broken relay stream auth; both ends of a pipe pair always derive the same parent from shared conn props, and the relay auth paths improve under this PR. Findings (most severe first):
Verified and dismissed (for the record): relay vs CP-behind-relay pipe ends computing different names (both ends always share conn props, so parents always match); the _ token ValueError behind relays (unreachable from every documented/in-tree path — standalone agents always take the ROOT_SERVER branch); VIA_ROOT stream auth for relay-registered clients (rejected identically on base — pre-existing, and this PR improves the supported VIA_CP/VIA_RELAY paths); and the ccwf abort-path error report (in-tree do_learn_task implementations return cleanly on abort, and workflow_done gates any post-abort report from reaching the server). The one item I'd treat as blocking is finding 1 — it's a hard regression of the documented secure 3rd-party integration path, with an executed repro. Findings 2–4 are act-on-but-minor. |
Address review findings on the CellPipe topology naming:
1. CellIdentityResolver only treats a leaf as a CellPipe alias when the
FQCN is a single segment (legacy sibling alias) or a direct child of
the local cell (a pipe connected through this cell, e.g. a relay).
Anywhere else the leaf is a plain <token>_<mode> segment whose token
may contain "_" (e.g. site-1.ext_trainer_active), and alias-parsing
it fabricated a wrong owner ("ext"), breaking mTLS identity
resolution for the documented secure external-trainer integration.
2. _cell_fqcn only rejects "." in tokens for CP/relay-connected pipes,
where extra FQCN segments would recreate an unconnected parent.
Root-connected cells route via the root fall-through regardless of
depth, so dotted agent ids keep working as they did before.
3. ClientSideController._do_learn uses the fl_ctx-aware log_exception
helper like its sibling exception paths.
4. Rename test_client_ctl.py to client_ctl_test.py to follow the
documented [module_name]_test.py convention.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@chesterxgchen All four findings are addressed in cb0b917: 1. Alias parse breaking underscore tokens (blocking) — 2. Over-broad "." rejection — the fail-fast is now scoped to the CP/relay-connected branches, where extra segments would recreate the unconnected-parent bug. The ROOT_SERVER (and missing-parent) branches accept dotted tokens again and name the cell 3. 4. Test naming — renamed to All touched suites pass locally (identity_binding, cell_pipe, client_ctl, authenticator, core_cell_routing — 109 tests), plus black/isort/flake8 on the changed files. |
The configured token (e.g. "{JOB_ID}") can resolve to an empty string
when no job id is available, which named the cell "<site>._<mode>"
(e.g. site-2._active) and made all such pipes on a site collide on a
degenerate name. _cell_fqcn now substitutes a fixed "default" token so
the name stays well-formed (site-2.default_active,
relay-1.site-2_default_active behind a relay), and CellPipe warns when
the fallback is used. Both ends of a pipe pair derive their own and the
peer's name from the same inputs, so they agree on the fallback.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
PR #4835 — CellPipe topology FQCN naming: three-angle review Architect: sound, with concerns The core move is architecturally correct, not a workaround: in CellNet, the FQCN is the routing address, and PR #4801 broke the invariant "address hierarchy ≈ connection topology." Naming the subprocess cell site-1._active makes its FQCN parent the cell it's physically connected to, so routing works through unmodified generic logic — the PR touches core_cell.py only in comments. A plain revert of #4801 would not have fixed the bug (the pre-#4801 flat name is gen-1 and fails the same routing dead-end at core_cell.py:1209), so the "third naming scheme" is justified: it's the first scheme that satisfies routing and mTLS identity simultaneously. Concerns: the divergence class isn't eliminated — root-connected pipes still rely on a self-described "load-bearing" fall-through, so the invariant is conditional and a future cell type can silently trip the same gen>1 dead-end. The underscore now does double duty (separator inside a segment), contained by a right-anchored grammar and construction-time rejection seam is untested (below).Mixed-version compatibility is actually good — legacy flat aliases are still accepted Principal engineer: core fix is correct (verified, not assumed) The reviewer traced the failing and fixed paths end-to-end, confirmed the Non-blocking findings: (1) the new hard ValueErrors on dotted/underscored tokens are an upgrade break for custom FlareAgentWithCellPipe agent ids that previously worked — deliberate and loud, but release-note it; (2) a mixed-version CJ↔subprocess pair (training venv on older nvflare) fails with "peer FQCN mismatch" — inherent to any rename, second rename in a row for this cell, also a release-note item; Security engineer: overall risk LOW
Two low-severity residuals, both independently found by all three reviewers:
Consolidated recommendation Suggested follow-ups (none blocking): add a local_fqcn= identity-binding test for underscore-token pipe children; make the empty-token fallback unique or raising; add release notes for the new token restrictions and the CJ↔subprocess version-mismatch behavior; put a compat/naming-scheme note next to the grammar in fqcn.py; and consider a connected-ancestor last-resort in _try_find_ep as the deeper routing fix. The bundled client_ctl.py error-reporting fix is correct but separable — fine to keep, worth a mention in the PR description. |
… scheme - An empty CellPipe token now raises ValueError in all branches instead of silently falling back to a shared "default" name: an empty token cannot uniquely name the cell, and a generated per-process fallback is not possible because the two ends of a pipe pair derive each other's names independently. This restores the loud failure the pre-PR code had (invalid FQCN) with a clearer message. - Pin the CP-local alias seam in a test: a CP resolving its own underscore-token pipe child parses a fabricated owner (fail closed, non-mTLS in practice) - documented rather than special-cased since the name alone cannot distinguish it from a relay alias. - Add direct unit tests for make/parse_cell_pipe_alias and relay-shape routing tests for <relay>.<site>_<token>_<mode> cells. - Document the CellPipe naming-scheme history and mixed-version behavior next to the alias grammar in fqcn.py. - Release-note the new token restrictions and the CJ/subprocess version-mismatch behavior in flare_280.rst. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@chesterxgchen Thanks for the three-angle review. The suggested follow-ups are addressed in 16916c5: 1. Empty-token fallback → now raising. Of your two options (unique or raising), a process-unique suffix is structurally impossible for CellPipe: the two ends of a pipe pair derive each other's cell names independently from the same 2. CP-local alias seam → pinned in a test. 3. Test gaps — added 4. Compat/naming-scheme note — the grammar block in 5. Release notes — 6. PR description — updated to call out the separable ccwf Deferred as a separate follow-up: the connected-ancestor last-resort in All cellnet/pipe/authenticator/ccwf suites pass (201 tests) plus black/isort/flake8 on the touched files. |
CellPipe leaves are now explicitly marked: - plain leaf "cellpipe-<token>_<mode>" for root/own-CP connections - alias leaf "cellpipe-alias-<owner>_<token>_<mode>" behind a relay This makes the alias grammar unambiguous instead of positionally gated: a plain leaf whose token contains "_" (e.g. cellpipe-ext_trainer_active) can never be misread as an alias, so identity resolution recognizes the marked alias at any depth (restoring owner resolution from distant cells, e.g. cert exchange through the server) and resolves any plain pipe leaf to the identity of the cell it is named under - which also fixes the CP-local seam where an underscore-token pipe child parsed to a fabricated owner. Stream auth accepts the bare legacy grammar only for whole-FQCN (pre-2.8 flat) origins and requires the alias marker at any depth. Tokens starting with "alias-" are rejected at construction so the plain namespace cannot collide with the alias namespace. Verified end-to-end with a POC FedAvg ext-process job (2 clients, 2 rounds, 109MB streamed model): pipes rendezvous under the new names and the job completes cleanly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Follow-up in 100e8e8: CellPipe leaves are now explicitly marked, per review discussion —
This eliminates the alias/plain ambiguity grammatically instead of containing it positionally:
Verified beyond unit tests (213 passing across cellnet/pipe/authenticator/ccwf, plus direct grammar tests): an end-to-end POC FedAvg run (2 clients, 2 rounds, ext-process ScriptRunner, 109 MB streamed model) completes cleanly with pipes rendezvousing under the new names ( |
Summary
site-1.<jobid>_activeinstead ofsite-1.<jobid>.active.server.<jobid>through the client CP.Root Cause
PR #4801 changed CellPipe cells to hierarchical names such as
site-1.<jobid>.active. The subprocess cell connects physically tosite-1, but its FQCN parent becamesite-1.<jobid>, which is not connected. Cross-family routing toserver.<jobid>then fell back to the missing FQCN parent and returnedTARGET_UNREACHABLE.Validation
~/nvflare-env/3.12/bin/python -m pytest tests/unit_test/fuel/utils/pipe/cell_pipe_test.py tests/unit_test/fuel/f3/cellnet/core_cell_routing_test.py tests/unit_test/fuel/f3/cellnet/identity_binding_test.py tests/unit_test/private/fed/authenticator_test.py -q~/nvflare-env/3.12/bin/python -m black --check ...on touched files~/nvflare-env/3.12/bin/python -m isort --check-only ...on touched files~/nvflare-env/3.12/bin/python -m flake8 ...on touched filesNote: repo-wide
PATH="$HOME/nvflare-env/3.12/bin:$PATH" ./runtest.sh -swas attempted, but failed on unrelated checked-inexamples/.../node_modules/.../flatted.pyfiles that black wants to reformat.NVBug: 6371056
Also included (separable)
ClientSideController._do_learnnow recordsReturnCode.EXECUTION_EXCEPTIONwhendo_learn_taskraises, so the job ends with an error status instead ofFINISHED:COMPLETED. This also fixes the brokenself.logger.log(msg)call in that handler (missing level argument).